This is an introduction to Spatial Analysis in R. I’ve developed this code based on some common questions from friends and colleagues or ones that I’ve asked myself. There is a lot here to help you get started, but there is also a lot more to learn!
The focus here will be on raster analysis, rather than vector (shapefiles, polygons, polylines, points, etc.).
I’ve drafted this informal introductory session in the form of answering a scientific question….
Where are optimal sites for Sea Monkey aquaculture off the west coast?
We will answer this question by taking into consideration the following spatial data:
1. Sea Surface Temperature 2. Net Primary Productivity 3. Marine Protected Areas
Key information for optimal growth:
Raster or gridded data are stored as a grid of value which are rendered on a map as pixels. Each pixel value represents an area on the Earth’s surface.
Some examples of raster data include oceanographic datasets such as Sea Surface Temperature, land use maps and digital elevation maps.
Raster data can come in many different formats. In this tutorial, we will use the geotiff format which has the extension .tif. A .tif file stores metadata or attributes about the file as embedded tif tags. These tags can include the following raster metadata:
CRS)extent)NoDataValue)resolution of the dataInformation in this section is borrowed from NEON’s Intro to Raster Data in R tutorial, another great resource
There are a lot of spatial packages for R, we will touch on some of them here but not all of them. Here is brief overview, taken from this site:
raster package is also able to read NetCDF files and I prefer to use Raster whenever possible.Load all libraries:
For raster data visualization I prefer the spectral color scheme rather than the base graphics package. I’m also setting the plotting margins much smaller so that the plots will show up larger in the viewing pane.
My first step in a spatial analysis is prepping the data, which includes the following:
Read in a shapefile of the US West Coast and northern Baja peninsula by using readOGR from the rgdal package.
## OGR data source with driver: ESRI Shapefile
## Source: "data", layer: "ca_current"
## with 2 features
## It has 7 fields
Plotting a shapefile is just as easy as:
And to add land to the map, do the following (from the maps package)
The information in the summary of the shapefile is important if you need to understand what projection your SpatialPolygonsDataFrame is in, along with the extent and number of features. In this case there are just two features, the EEZs of the US West Coast and northern Baja Mexico. You can get this information by just typing in the name ‘cc’
## class : SpatialPolygonsDataFrame
## features : 2
## extent : -129.1635, -109.9721, 21.6185, 49.08721 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## variables : 7
## names : EEZ, Country, ID, Sovereign, Remarks, Sov_ID, Last_Chang
## min values : Mexican Exclusive Economic Zone, Mexico, 135, Mexico, NA, 135, 30/03/2006
## max values : United States Exclusive Economic Zone, United States, 163, United States, NA, 163, 30/03/2006
You can look at the data held within a shapefile by calling cc@data. The dataframe associated with the shapefile can be treated just as any other dataframe. Columns can be added, names can be changed, tables can be joined.
## EEZ Country ID Sovereign
## 0 United States Exclusive Economic Zone United States 163 United States
## 1 Mexican Exclusive Economic Zone Mexico 135 Mexico
## Remarks Sov_ID Last_Chang
## 0 <NA> 163 30/03/2006
## 1 <NA> 135 30/03/2006
## EEZ Country ID Sovereign
## 0 United States Exclusive Economic Zone United States 163 United States
## 1 Mexican Exclusive Economic Zone Mexico 135 Mexico
## Remarks Sov_ID Last_Chang short
## 0 <NA> 163 30/03/2006 USA
## 1 <NA> 135 30/03/2006 MEX
Now that we have our boundary area defined by the shapefile, we can start prepping the raster data.
In the Data folder, there are 5 .tif files with the naming pattern average_annual_sst_[year].tif, which are 5 annual average sea surface temperatures for our region (2008-2012). We want just one raster file of the average SST over that time period.
To create a single average Sea Surface Temperature layer, do the following:
## [1] "data/average_annual_sst_2008.tif" "data/average_annual_sst_2009.tif"
## [3] "data/average_annual_sst_2010.tif" "data/average_annual_sst_2011.tif"
## [5] "data/average_annual_sst_2012.tif"
Create a raster of the first file by calling raster() and then plot() to visualize.
Notice the data values are in Kelvin - we will change this to celsius later.
You can plot rasters or shapefiles on top of each other
I also like to look at the distribution of data. Using the histogram() function from rasterVis is my preference over hist() from the base package purely because of the visual output.
To get a single layer of average SST in degrees Celsius we need to first stack all layers.
You can perform operations on a RasterStack by using the calc() function from the raster package. calc() lets you define a function to apply across all layers in the stack.
Calculate the mean value per cell and then convert to Celsius by subtracting 273.15.
A more compact way of doing multiple raster analysis is by using pipes…you can run stack() and calc() in one call!
Read in this data the same way as the SST data, using raster(). This data is the net primary production (mgC/m2/day).
## class : RasterLayer
## dimensions : 145, 114, 16530 (nrow, ncol, ncell)
## resolution : 112000, 54700 (x, y)
## extent : -17813502, -5045502, 1526148, 9457648 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs
## data source : C:\Users\Afflerbach\Documents\github\spatial-analysis-R\data\annual_npp.tif
## names : annual_npp
## values : 1.898059, 3.783287 (min, max)
You’ll see that this is in a different projection, extent and cell size from the SST data. It is really obvious when you look at the plot, but the summary above also gives clues as to what projection/resolution/extent this data is in.
To do any sort of analysis using multiple rasters, they all need to be in the same extent, projection and cell resolution.
First look at the differences:
## class : RasterLayer
## dimensions : 720, 648, 466560 (nrow, ncol, ncell)
## resolution : 0.04166185, 0.04165702 (x, y)
## extent : -131.9848, -104.9879, 19.99537, 49.98842 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +ellps=WGS84 +no_defs
## data source : in memory
## names : layer
## values : -1.57, 38.24 (min, max)
## class : RasterLayer
## dimensions : 145, 114, 16530 (nrow, ncol, ncell)
## resolution : 112000, 54700 (x, y)
## extent : -17813502, -5045502, 1526148, 9457648 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs
## data source : C:\Users\Afflerbach\Documents\github\spatial-analysis-R\data\annual_npp.tif
## names : annual_npp
## values : 1.898059, 3.783287 (min, max)
To get the primary productivity data in the same format as the SST data, we need to
reprojectcropresampleUse projectRaster() from the raster package to reproject a RasterLayer from one projection to another. You will need to define what the new projection should be by setting a coordinate reference system.
Defining a coordinate reference system (crs) can be done in many ways. See Melanie’s great cheat sheet for more details about Coordinate Reference Systems.
Here, we want to project from Mollweide to longlat
Now that the layer is in the right projection, we need to crop it to our study area and make sure all raster layers have the same extent.
You can set the extent using extent(). You can also use another raster object to define the extent (in this case we will use sstAvg)
Just by plotting both the SST and OA data, you can tell right away that these two datasets have different cell resolutions. The NPP data needs to be resampled to the same cell size as SST in order to do any sort of analysis on these two. Use the nearest neighbor method to avoid interpolation between points. We want to keep the data values the same as the original data, but at a higher resolution.
Here you can see the difference:
## null device
## 1
NOTE: Typically you’ll want to disaggregate cells to match data of a higher resolution. Otherwise, if we aggregate the cells from the SST data, we would lose data.
Again we can condense this script by using pipes!
Check to see that we can use the SST and NPP data together now
## class : RasterStack
## dimensions : 720, 648, 466560, 2 (nrow, ncol, ncell, nlayers)
## resolution : 0.04166185, 0.04165702 (x, y)
## extent : -131.9848, -104.9879, 19.99537, 49.98842 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +ellps=WGS84 +no_defs
## names : annual_npp, layer
## min values : 2.302361, -1.570000
## max values : 3.487666, 38.240000
Now that our data is prepped and guaranteed to play nicely, we can move onto the fun stuff - analyzing the data. For this specific analysis, we need to use the SST and NPP data to find areas within the cc region that are suitable for growing seamonkeys. This requires removal of all cells from NPP and SST that are not within the ideal growth parameter range.
Remove all cells from the Sea Surface Temperature layer that fall out of the species temperature range
Seamonkeys grow best in waters that are between 12 and 18 degrees Celsius.
Remembering that sstAvg is our current SST layer, you can eliminate all cells with values outside of your range in a few different ways.
I prefer to do a simple subset using brackets, but first assigning a new variable so you don’t overwrite sstAvg
Remove all cells from the Ocean Acidification layer that fall out of the species preferred range
Our species also prefers water with a mean primary production of between 2.6 and 3 mgC/m2/day
Now that we have our two rasters with suitable parameters, there are a lot of different ways to select overlapping cells. I like to do this in a binary way all suitable cells set equal to 1, the rest NA or 0.
Set all viable cells in SST to equal 1
…and same for NPP
Now that we have these two binary layers, we can combine them using overlay() from the raster package and the resulting cells, equal to 1, are the cells that meet both SST and NPP requirements ***
SIDE NOTE:
You can perform mathematical operations on single or multiple raster layers using base R functions. Both calc() and overlay() are useful when you have complex functions working on these layers. Here are some examples of how you can use raster layers with base R functions:
***
This could be all you need - but I want to show some additional steps to highlight more functionality in R.
You can remove cells outside of your region by using the mask() function. Here, you only want to keep cells that are within the Mexican or US EEZs.
You can then crop your raster to a smaller extent
Another nifty thing - if you don’t know the extent, you can draw it and R will tell you! You can then save this extent and crop to it
There are two shapefiles in the data folder. One is ‘MPA_State’ which are state MPAs, and one is ‘National Marine Sanctuaries’. To go along with the aquaculture analysis, lets say that cells within these regions must be excluded from the suitability analysis.
## class : SpatialPolygonsDataFrame
## features : 101
## extent : -373440.2, 259220.6, -590425.8, 259522.3 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0
## variables : 7
## names : NAME, Type, Length_nmi, CCR, Area_nmi, Perimeter_, present
## min values : Año Nuevo, SMCA, 0.00, Section 632 (b)1, 0.01000000, 0.560000, 1
## max values : White Rock (Cambria), Special Closure, 15.36, Yet to be published, 24.79270395, 31.336239, 1
## NULL
Notice that when creaing mpa_c there is no error. But when you call it the output is NULL, indicating a NULL object. This is because the MPA shapefile is not in the same projection
Using spTransform() from the rgdal package, you can project a shapefile in a similar manner as raster.
The same can be done with the National Marine Sanctuary shapefile, using pipes!
To remove cells in the MPAs and NMS just use the mask() function but set inverse=T so that all cells outside of the polygons are kept and those inside are set to NA.
With any raster analysis, you likely aren’t just creating a pretty map. Here is an example of running zonal statistics on a raster.
We want to look at the total area (km2) in Mexico and California for seamonkey aquaculture.
First you want to turn a shapefile (california current) into a raster of the same projection/extent/resolution as your cells.
## EEZ Country ID Sovereign
## 0 United States Exclusive Economic Zone United States 163 United States
## 1 Mexican Exclusive Economic Zone Mexico 135 Mexico
## Remarks Sov_ID Last_Chang short
## 0 <NA> 163 30/03/2006 USA
## 1 <NA> 135 30/03/2006 MEX
## class : RasterLayer
## dimensions : 180, 321, 57780 (nrow, ncol, ncell)
## resolution : 0.04166185, 0.04165702 (x, y)
## extent : -126.7354, -113.3619, 27.49363, 34.99189 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : layer
## values : 1, 2 (min, max)
## attributes :
## ID EEZ Country ID.2 Sovereign
## 1 United States Exclusive Economic Zone United States 163 United States
## 2 Mexican Exclusive Economic Zone Mexico 135 Mexico
## Remarks Sov_ID Last_Chang short
## <NA> 163 30/03/2006 USA
## <NA> 135 30/03/2006 MEX
R automatically assigned 1 and 2 as the cell values. You can set your own values too based on a field or another defined vector of ids.
## class : RasterLayer
## dimensions : 180, 321, 57780 (nrow, ncol, ncell)
## resolution : 0.04166185, 0.04165702 (x, y)
## extent : -126.7354, -113.3619, 27.49363, 34.99189 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : layer
## values : 135, 163 (min, max)
To get the total viable area for aquaculture of seamonkeys in the California Current, run zonal() using ccZones. The zonal() function is given any sort of function you define including sum, mean, max, etc.
Since the current values of prefRas are all equal to 1, we can simply sum up the number of cells in each EEZ.
## zone sum
## [1,] 1 10663
## [2,] 2 4725
## null device
## 1
But that isn’t as useful as calculating the actual area in km2. Using the area() function from the raster package you can create a new raster with cell values equal to their area, and then run zonal().
## zone sum
## [1,] 1 190867.6
## [2,] 2 87307.5